=====================================================================
This testsuite and set of compressors is intended just for
research purposes.

For any problems, contact: Paolo Ferragina, ferragina@di.unipi.it

@Copyright, Ferragina 2009
======================================================================


0. Description of the software

The directory you have downloaded contains:

- 7za: the executable of LZMA and downloaded from http://www.7-zip.org/.
       Sources in 7za/

- bst, unbst: the executable of the compression booster: bzip with unbounded window.
       Downloaded from http://web.unipmn.it/~manzini/boosting/
       Sources in booster/

- BMIcompress, BMIdecompress: the executables of the bmi-preprocessor implemented
       by Bentley-McIlroy (2001). Downloaded from http://www.cs.dartmouth.edu/~doug/jlbcompress.tar
       Sources in bmi/


- ppdmi, unpdmi: the implemntation of PPMonster by D. Shkarin (2002). 


The you have a set of PERL scripts that allow you to run the compressors over the WARC-files
subdividing them into blocks of pages.

- blocked_compress.pl <blockSize> <compressorID> <numBlocksToCompress> <WriteCompressed>

  The file (in WARC format) to be compressed is passed via STDIN.

  Parameter #1 indicates the size of the block in MBs. So 1 means 1MB.
  Parameter #2 specifies the compressor to be used. The subset that you have here available is:
     0 = gzip -9, 1 = bzip2 -9, 2 = ppdmi, 3 = booster, 15 = lzma, 16 = bmi+gz, 18 = bmi+lzma.
  Parameter #3 indicates hom many blocks to compress (0 = all)
  Parameter #4 indicates 1 = write the compressed file, 0 = silent mode (no writing)

  During the compression the script will print on STDOUT some statistics, and in the file stats.txt
  a more detailed log of the compression process.


- blocked_decompress.pl <WriteDecompressed>
   
  The file to be decompressed is read from the auxiliary files: _compr-file_.out and _compr-sizes.out
  
  Parameter #1 indicates 1 = write the decompressed file, 0 = silent mode (no writing)


- CSAtest_blocked.pl <blockSize> <BuildSearch> <SampleDist>

  The file (in WARC format) to be compressed is passed via STDIN.

  Parameter #1 indicates the size of the block in MBs. So 1 means 1MB.
  Parameter #2 specifies 0 = build the CSA, 1 = Search the CSA
  Parameter #3 specifies the sampleDistance to be assigned to -p: 128, 256, 1024,.... in our experiments

  I suggest you to use just the version: "building", because the "searching" needs further files to
  test the CSA.

 
- random_permuting.pl
  
  Reads from STDIN the file (WARC-format) and writes in the output two files which specify a random
  permutation of the pages. ( _PermutedPos_.txt e _PageSizes_.txt)
       
- url_sorting.pl

  Reads from STDIN the file (WARC-format) and writes in the output two files which specify the url_based
  permutation of the pages. ( _PermutedPos_.txt e _PageSizes_.txt)


- permute.pl

  Read the pages to be permuted from the STDIN, and the permutation from the two 
  files: _PermutedPos_.txt and _PageSizes_.txt
  Then realizes this permutation by writing out the new WARC-formatted file
  of the permuted pages, naming it _FinalPermuted_.txt


- stripping_html.pl

  Receives from STDIN the file to be cleaned, and then writes to STDOUT the cleaned file.
  [You must install, with root privileges, the PERL package HTML-Strip-1.06 present in this dir]



